library(tidyverse)
library(here)
library(janitor)
library(ggeffects)
library(performance)
library(naniar)
library(flextable)
library(car)
library(broom)
library(corrplot)
library(AICcmodavg)
library(GGally)
library(MuMIn)
Introduction: Introduction:
The analysis conducted in this study focuses on Sarracenia, a genus of carnivorous plants commonly known as pitcher plants. Sarracenia plants are known for their unique pitcher-shaped leaves that capture and digest insects, enabling them to obtain nutrients from the prey.
Understanding Sarracenia plants and their characteristics is of interest to various groups of people. Botanists and ecologists study these plants to gain insights into their ecological role, adaptations, and interactions with other organisms. Conservationists are interested in preserving pitcher plant habitats and ensuring the survival of these unique plant species. Furthermore, Sarracenia plants have also gained popularity among horticultural enthusiasts and collectors.
In this analysis, we aim to predict the individual biomass of Sarracenia plants using morphological, physiological, and taxonomic characteristics. Predicting biomass can provide valuable information about plant growth and productivity, which is important for understanding plant ecology, nutrient dynamics, and ecosystem functioning. Additionally, predicting biomass can aid in assessing the health and condition of pitcher plant populations, enabling conservation efforts and management strategies to be implemented effectively.
The main questions addressed in this analysis include: What factors influence the individual biomass of Sarracenia plants? Which morphological, physiological, and taxonomic characteristics are most strongly associated with biomass? Are there significant differences in biomass among different species or feeding levels of Sarracenia plants?
To answer these questions, we formulate the following hypotheses:
Morphological characteristics such as the number of leaves and phyllodia are positively correlated with the biomass of Sarracenia plants. Physiological traits such as specific leaf area (SLA) and chlorophyll content are positively associated with biomass. Taxonomic characteristics, such as the species of Sarracenia, may influence biomass, with certain species exhibiting higher biomass than others. Feeding level, which indicates the amount of prey captured and digested by the pitcher plant, may be positively correlated with biomass. By analyzing a dataset of Sarracenia plants and conducting various statistical models, we aim to identify the key predictors of individual biomass and gain insights into the factors driving the growth and productivity of these fascinating carnivorous plants.
# Read the CSV file and store it in the 'plant' variable
plant <- read_csv(here("data", "hf109-01-sarracenia.csv")) |>
# Clean the column names for clarity
clean_names() |>
# Select the columns of interest
select(
totmass,
species,
feedlevel,
sla,
chlorophyll,
amass,
num_lvs,
num_phylls
)
#check for missing data
gg_miss_var(plant)
plant_subset <- plant|>
#create better plant ibject
drop_na(sla, chlorophyll, amass, num_lvs, num_phylls)
# Calculate Pearson's correlation coefficient
plant_cor <- plant_subset |>
select(feedlevel:num_phylls) |>
cor(method = "pearson")
# Create a correlation plot
corrplot(plant_cor,
method = "ellipse",
addCoef.col= "black")
#Creates pairs plot
plant_subset |>
select(species:num_phylls) |>
ggpairs()
#linear models
null_m <- lm(totmass ~ 1, data= plant_subset)
full_m <- lm(totmass ~ species + feedlevel + sla + chlorophyll + amass + num_lvs + num_phylls, data= plant_subset)
#plotting model
par(mfrow = c(2,2))
plot(full_m)
#checking assumptions
check_normality(full_m)
## Warning: Non-normality of residuals detected (p < .001).
check_heteroscedasticity(full_m)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
#transforming models and plotting
null_l <- lm(log(totmass)~1, data = plant_subset)
full_l <- lm(log(totmass)~ species + feedlevel + sla + chlorophyll + amass + num_lvs + num_phylls, data= plant_subset)
plot(full_l)
check_normality(full_l)
## OK: residuals appear as normally distributed (p = 0.107).
check_heteroscedasticity(full_l)
## OK: Error variance appears to be homoscedastic (p = 0.071).
#better models
model_1 <- lm(log(totmass)~chlorophyll, data= plant_subset)
model_2 <- lm(log(totmass)~amass, data = plant_subset)
model_3 <- lm(log(totmass)~species, data = plant_subset)
#Checking assumptions
plot(model_1)
check_normality(model_1)
## Warning: Non-normality of residuals detected (p = 0.002).
check_heteroscedasticity(model_1)
## OK: Error variance appears to be homoscedastic (p = 0.546).
plot(model_2)
check_normality(model_2)
## Warning: Non-normality of residuals detected (p = 0.002).
check_heteroscedasticity(model_2)
## OK: Error variance appears to be homoscedastic (p = 0.628).
plot(model_3)
check_normality(model_3)
## OK: residuals appear as normally distributed (p = 0.374).
check_heteroscedasticity(model_3)
## OK: Error variance appears to be homoscedastic (p = 0.100).
#check
vif(full_l)
## GVIF Df GVIF^(1/(2*Df))
## species 42.351675 9 1.231351
## feedlevel 1.621993 1 1.273575
## sla 1.999989 1 1.414210
## chlorophyll 1.949828 1 1.396362
## amass 2.872084 1 1.694722
## num_lvs 2.813855 1 1.677455
## num_phylls 2.995510 1 1.730754
#comparing models
AICc(full_l)
## [1] 133.9424
AICc(model_1)
## [1] 307.368
AICc(model_2)
## [1] 308.0364
AICc(model_3)
## [1] 157.5751
AICc(null_l)
## [1] 306.0028
MuMIn::AICc(full_l, model_1, model_2, model_3, null_l)
## df AICc
## full_l 17 133.9424
## model_1 3 307.3680
## model_2 3 308.0364
## model_3 11 157.5751
## null_l 2 306.0028
MuMIn::model.sel(full_l, model_1, model_2, model_3, null_l)
## Model selection table
## (Int) ams chl fdl num_lvs num_phy sla spc df
## full_l -1.3390 0.002338 0.004368 -0.4743 0.09176 -0.03959 -0.002493 + 17
## model_3 0.8836 + 11
## null_l 1.3500 2
## model_1 0.4650 0.001877 3
## model_2 1.3970 -0.001333 3
## logLik AICc delta weight
## full_l -46.371 133.9 0.00 1
## model_3 -66.337 157.6 23.63 0
## null_l -150.941 306.0 172.06 0
## model_1 -150.563 307.4 173.43 0
## model_2 -150.897 308.0 174.09 0
## Models ranked by AICc(x)
summary(full_l)
##
## Call:
## lm(formula = log(totmass) ~ species + feedlevel + sla + chlorophyll +
## amass + num_lvs + num_phylls, data = plant_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88872 -0.20811 0.02825 0.24218 0.78287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.339043 0.597727 -2.240 0.027624 *
## speciesalata 1.113163 0.184021 6.049 3.56e-08 ***
## speciesflava 1.404562 0.262955 5.341 7.29e-07 ***
## speciesjonesii 0.319652 0.196426 1.627 0.107281
## speciesleucophylla 1.709035 0.227608 7.509 4.88e-11 ***
## speciesminor 0.389310 0.187903 2.072 0.041239 *
## speciespsittacina -1.645198 0.207035 -7.946 6.36e-12 ***
## speciespurpurea -0.364348 0.254380 -1.432 0.155643
## speciesrosea -0.947383 0.260495 -3.637 0.000467 ***
## speciesrubra 0.875342 0.196361 4.458 2.46e-05 ***
## feedlevel -0.474255 0.234493 -2.022 0.046199 *
## sla -0.002493 0.001160 -2.149 0.034430 *
## chlorophyll 0.004368 0.001189 3.672 0.000414 ***
## amass 0.002338 0.002988 0.782 0.436166
## num_lvs 0.091764 0.022413 4.094 9.46e-05 ***
## num_phylls -0.039585 0.051714 -0.765 0.446068
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.413 on 87 degrees of freedom
## Multiple R-squared: 0.8687, Adjusted R-squared: 0.8461
## F-statistic: 38.38 on 15 and 87 DF, p-value: < 2.2e-16
tidy(full_l, conf.int = TRUE, conf.level = .95)
## # A tibble: 16 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.34 0.598 -2.24 2.76e- 2 -2.53 -0.151
## 2 speciesalata 1.11 0.184 6.05 3.56e- 8 0.747 1.48
## 3 speciesflava 1.40 0.263 5.34 7.29e- 7 0.882 1.93
## 4 speciesjonesii 0.320 0.196 1.63 1.07e- 1 -0.0708 0.710
## 5 speciesleucophylla 1.71 0.228 7.51 4.88e-11 1.26 2.16
## 6 speciesminor 0.389 0.188 2.07 4.12e- 2 0.0158 0.763
## 7 speciespsittacina -1.65 0.207 -7.95 6.36e-12 -2.06 -1.23
## 8 speciespurpurea -0.364 0.254 -1.43 1.56e- 1 -0.870 0.141
## 9 speciesrosea -0.947 0.260 -3.64 4.67e- 4 -1.47 -0.430
## 10 speciesrubra 0.875 0.196 4.46 2.46e- 5 0.485 1.27
## 11 feedlevel -0.474 0.234 -2.02 4.62e- 2 -0.940 -0.00818
## 12 sla -0.00249 0.00116 -2.15 3.44e- 2 -0.00480 -0.000187
## 13 chlorophyll 0.00437 0.00119 3.67 4.14e- 4 0.00200 0.00673
## 14 amass 0.00234 0.00299 0.782 4.36e- 1 -0.00360 0.00828
## 15 num_lvs 0.0918 0.0224 4.09 9.46e- 5 0.0472 0.136
## 16 num_phylls -0.0396 0.0517 -0.765 4.46e- 1 -0.142 0.0632
#prediction
prediction <- ggpredict(full_l, terms= "num_lvs", back.transform = TRUE)
summary(null_l)
##
## Call:
## lm(formula = log(totmass) ~ 1, data = plant_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5602 -0.7848 0.1225 0.8320 1.7374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3495 0.1037 13.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.053 on 102 degrees of freedom
#plot
plot(prediction, add.data=TRUE)
Data collection summary:
The data was collected from a CSV file named “hf109-01-sarracenia.csv.” The data includes variables such as total mass, species, feed level, specific leaf area (SLA), chlorophyll content, aerial mass, number of leaves, and number of phyllodes. Description of data organization/processing:
The data was read from the CSV file and stored in the ‘plant’ variable. The column names were cleaned for clarity. A subset of columns of interest was selected for further analysis. Discussion of missing observations:
A visualization of missing data was created using the gg_miss_var() function. Missing values in the variables related to SLA, chlorophyll, aerial mass, number of leaves, and number of phyllodes were dropped using drop_na() function to create the ‘plant_subset’ object. Discussion of Pearson correlation:
Pearson’s correlation coefficient was calculated for the variables in the ‘plant_subset’ using the cor() function. The correlation coefficients were visualized using a correlation plot created with corrplot() function. Discussion of relationships between variables:
The pairs plot was created using the ggpairs() function to visualize relationships between variables in the ‘plant_subset’ dataset. Null model:
The null model was created using the formula totmass ~ 1 in the lm() function. It represents the simplest model without any predictors. Full model:
The full model was created using multiple predictors: species, feed level, SLA, chlorophyll content, aerial mass, number of leaves, and number of phyllodes. The formula totmass ~ species + feedlevel + sla + chlorophyll + amass + num_lvs + num_phylls was used in the lm() function. Discussion of null and full model:
The null model provides a baseline against which the full model can be compared to determine if the predictors significantly improve the model’s performance. By comparing the summaries and diagnostic plots of the null and full models, the impact of the predictors on the model can be assessed. Explanation of predictors for the second test model:
The second test model includes a single predictor, chlorophyll content, and the formula log(totmass) ~ chlorophyll was used. This model aims to investigate the relationship between chlorophyll content and log-transformed total mass while controlling for other variables. Explanation of predictors for the third test model:
The third test model includes a single predictor, species, and the formula log(totmass) ~ species was used. This model examines the effect of different species on log-transformed total mass while accounting for other variables. Model comparison:
The AICc values were calculated for the full model, second test model (chlorophyll predictor), third test model (species predictor), and the null model. The models were compared using the AICc() function and the model.sel() function from the MuMIn package to determine the best-fitting model based on AICc values.
The best model chosen for the analysis is the “full_l” model, which incorporates log-transformed total mass as the response variable and multiple predictors, including species, feed level, specific leaf area (SLA), chlorophyll content, aerial mass, number of leaves, and number of phyllodes. The selection of the best model was based on the AICc values, where the model with the lowest AICc value was deemed the most suitable, striking a balance between goodness of fit and model complexity. However, it is essential to assess whether the best model adheres to certain assumptions, such as linearity, independence, homoscedasticity, and normality of residuals, which were evaluated using diagnostic plots. The conformity of the model to these assumptions is crucial, as deviations might affect the validity of the model’s results and interpretations.
The summary of the “full_l” model provides valuable information regarding the coefficients, standard errors, t-values, and p-values associated with each predictor. Moreover, it includes an estimated intercept, representing the expected log-transformed total mass when all predictors are set to zero. Interpretation of the model becomes possible through the examination of these coefficients, allowing us to understand the direction and magnitude of the relationships between each predictor and the log-transformed total mass. Positive coefficients suggest that an increase in a specific predictor corresponds to an increase in log-transformed total mass, while negative coefficients indicate the opposite effect.